Goal of this tutorial

This tutorial aims at showing how to use the gProfiler2 R package. It is more or less a Rmd version of the vignette. You can refer to the vignette and reference manual (links below) for more details.

g:GOSt - functional enrichment analysis of gene lists

Here is an example in Homo sapiens with various types of gene identifiers, a SNP id, chromosomal intervals and a term ID from the Gene Ontology. Parameters are described in the vignette. The result is a named list where “result” is a data.frame with the enrichment analysis results and “meta” contains a named list with all the metadata for the query.

# Example of gene list functional enrichment analysis with gost
gostres <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"), 
                organism = "hsapiens", ordered_query = FALSE, 
                multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                measure_underrepresentation = FALSE, evcodes = FALSE, 
                user_threshold = 0.05, correction_method = "g_SCS", 
                domain_scope = "annotated", custom_bg = NULL, 
                numeric_ns = "", sources = NULL, as_short_link = FALSE)

names(gostres)
[1] "result" "meta"  
summary(gostres)
       Length Class      Mode
result 14     data.frame list
meta    5     -none-     list
head(gostres$result)
    query significant      p_value term_size query_size intersection_size precision     recall    term_id source                                             term_name effective_domain_size source_order                            parents
1 query_1        TRUE 2.813706e-30        88         22                16 0.7272727 0.18181818 GO:0048013  GO:BP                     ephrin receptor signaling pathway                 18092        14554                         GO:0007169
2 query_1        TRUE 1.077092e-21       285         22                16 0.7272727 0.05614035 GO:0007411  GO:BP                                         axon guidance                 18092         3305             GO:0007409, GO:0097485
3 query_1        TRUE 1.140559e-21       286         22                16 0.7272727 0.05594406 GO:0097485  GO:BP                            neuron projection guidance                 18092        21970 GO:0006928, GO:0006935, GO:0048812
4 query_1        TRUE 6.823529e-18       489         22                16 0.7272727 0.03271984 GO:0007409  GO:BP                                          axonogenesis                 18092         3304 GO:0048667, GO:0048812, GO:0061564
5 query_1        TRUE 2.809503e-17       534         22                16 0.7272727 0.02996255 GO:0061564  GO:BP                                      axon development                 18092        18331                         GO:0031175
6 query_1        TRUE 2.846480e-16       617         22                16 0.7272727 0.02593193 GO:0048667  GO:BP cell morphogenesis involved in neuron differentiation                 18092        15031             GO:0000904, GO:0048666
names(gostres$meta)
[1] "query_metadata"  "result_metadata" "genes_metadata"  "timestamp"       "version"        

With the parameter evcodes = TRUE, the result will include the evidence codes. In addition, a column “intersection” will appear showing the input gene IDs that intersect with the corresponding functional term. Note that this parameter can decrease the performance and make the query slower.

# Example with intersection (can be time consuming)
gostres2 <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"), 
                organism = "hsapiens", ordered_query = FALSE, 
                multi_query = FALSE, significant = TRUE, exclude_iea = FALSE, 
                measure_underrepresentation = FALSE, evcodes = TRUE, 
                user_threshold = 0.05, correction_method = "g_SCS", 
                domain_scope = "annotated", custom_bg = NULL, 
                numeric_ns = "", sources = NULL)

head(gostres2$result)
    query significant      p_value term_size query_size intersection_size precision     recall    term_id source                                             term_name effective_domain_size source_order                            parents
1 query_1        TRUE 2.813706e-30        88         22                16 0.7272727 0.18181818 GO:0048013  GO:BP                     ephrin receptor signaling pathway                 18092        14554                         GO:0007169
2 query_1        TRUE 1.077092e-21       285         22                16 0.7272727 0.05614035 GO:0007411  GO:BP                                         axon guidance                 18092         3305             GO:0007409, GO:0097485
3 query_1        TRUE 1.140559e-21       286         22                16 0.7272727 0.05594406 GO:0097485  GO:BP                            neuron projection guidance                 18092        21970 GO:0006928, GO:0006935, GO:0048812
4 query_1        TRUE 6.823529e-18       489         22                16 0.7272727 0.03271984 GO:0007409  GO:BP                                          axonogenesis                 18092         3304 GO:0048667, GO:0048812, GO:0061564
5 query_1        TRUE 2.809503e-17       534         22                16 0.7272727 0.02996255 GO:0061564  GO:BP                                      axon development                 18092        18331                         GO:0031175
6 query_1        TRUE 2.846480e-16       617         22                16 0.7272727 0.02593193 GO:0048667  GO:BP cell morphogenesis involved in neuron differentiation                 18092        15031             GO:0000904, GO:0048666
                                                                                                                                                                       evidence_codes
1 IDA TAS IEA,ISS TAS IEA,TAS IEA,IDA IBA TAS,IGI TAS IEA,ISS TAS IEA,IDA TAS IEA,IDA TAS IEA,IGI ISS IBA TAS IEA,ISS TAS IEA,TAS,IDA TAS IEA,IDA TAS,TAS IEA,IDA TAS IEA,IBA TAS IEA
2                                                             IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA
3                                                             IBA,ISS IBA IEA,IBA,IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA
4                                                         IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA
5                                                     ISS IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA
6                                                         IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,ISS IBA IEA,IBA IEA,IBA,IBA,ISS IBA IEA,IBA,ISS IBA IEA,ISS IBA IEA,IBA,IBA,IBA
                                                                                                                                                                                                                                                     intersection
1 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
2 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
3 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
4 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
5 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364
6 ENSG00000044524,ENSG00000070886,ENSG00000080224,ENSG00000108947,ENSG00000116106,ENSG00000133216,ENSG00000135333,ENSG00000142627,ENSG00000143590,ENSG00000145242,ENSG00000146904,ENSG00000154928,ENSG00000182580,ENSG00000183317,ENSG00000196411,ENSG00000243364

The query results can also be gathered into a short-link to the g:Profiler web tool.

# Get a web link
gostres_link <- gost(query = c("X:1000:1000000", "rs17396340", "GO:0005005", "ENSG00000156103", "NLRP1"), 
                as_short_link = TRUE)
gostres_link
[1] "https://biit.cs.ut.ee/gplink/l/3Sjz6iFLT6"

The function gost also allows to perform enrichment on multiple input gene lists. If the parameter multiquery = TRUE is used, then the results from all of the input queries are grouped according to term IDs for better comparison.

# Multiple gene lists
multi_gostres1 <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340", 
                                                 "GO:0005005", "ENSG00000156103", "NLRP1"),
                             "chromY" = c("Y:1:10000000", "rs17396340", 
                                          "GO:0005005", "ENSG00000156103", "NLRP1")), 
                       multi_query = FALSE)

head(multi_gostres1$result)
   query significant      p_value term_size query_size intersection_size precision     recall    term_id source                                             term_name effective_domain_size source_order                            parents
1 chromX        TRUE 2.813706e-30        88         22                16 0.7272727 0.18181818 GO:0048013  GO:BP                     ephrin receptor signaling pathway                 18092        14554                         GO:0007169
2 chromX        TRUE 1.077092e-21       285         22                16 0.7272727 0.05614035 GO:0007411  GO:BP                                         axon guidance                 18092         3305             GO:0007409, GO:0097485
3 chromX        TRUE 1.140559e-21       286         22                16 0.7272727 0.05594406 GO:0097485  GO:BP                            neuron projection guidance                 18092        21970 GO:0006928, GO:0006935, GO:0048812
4 chromX        TRUE 6.823529e-18       489         22                16 0.7272727 0.03271984 GO:0007409  GO:BP                                          axonogenesis                 18092         3304 GO:0048667, GO:0048812, GO:0061564
5 chromX        TRUE 2.809503e-17       534         22                16 0.7272727 0.02996255 GO:0061564  GO:BP                                      axon development                 18092        18331                         GO:0031175
6 chromX        TRUE 2.846480e-16       617         22                16 0.7272727 0.02593193 GO:0048667  GO:BP cell morphogenesis involved in neuron differentiation                 18092        15031             GO:0000904, GO:0048666
multi_gostres2 <- gost(query = list("chromX" = c("X:1000:1000000", "rs17396340",
                                                 "GO:0005005", "ENSG00000156103", "NLRP1"),
                             "chromY" = c("Y:1:10000000", "rs17396340", 
                                          "GO:0005005", "ENSG00000156103", "NLRP1")), 
                       multi_query = TRUE)

head(multi_gostres2$result)
             term_id                   p_values significant term_size query_sizes intersection_sizes source                                               term_name effective_domain_size source_order                parents
1         GO:0005005 7.162616e-48, 4.090780e-44  TRUE, TRUE        16      23, 33             16, 16  GO:MF                  transmembrane-ephrin receptor activity                 18694         1548             GO:0005003
2         GO:0005003 6.933232e-45, 3.953788e-41  TRUE, TRUE        19      23, 33             16, 16  GO:MF                                ephrin receptor activity                 18694         1546             GO:0004714
3         GO:0004714 2.581144e-33, 1.439607e-29  TRUE, TRUE        63      23, 33             16, 16  GO:MF transmembrane receptor protein tyrosine kinase activity                 18694         1295 GO:0004713, GO:0019199
4 REAC:R-HSA-3928665 4.164481e-33, 1.836785e-32  TRUE, TRUE        49      20, 21             16, 16   REAC                  EPH-ephrin mediated repulsion of cells                 10531          747     REAC:R-HSA-2682334
5         GO:0019199 2.920627e-31, 1.613380e-27  TRUE, TRUE        82      23, 33             16, 16  GO:MF          transmembrane receptor protein kinase activity                 18694         4539 GO:0004672, GO:0004888
6         GO:0048013 2.813706e-30, 8.827603e-26  TRUE, TRUE        88      22, 34             16, 16  GO:BP                       ephrin receptor signaling pathway                 18092        14554             GO:0007169

The enrichment results can be visualized with a Manhattan-like-plot.

# Visualization
gostplot(gostres, capped = TRUE, interactive = TRUE)

If interactive = FALSE, then the function returns a static ggplot object.

The function publish_gostplot takes the static plot object as an input and enables to highlight a selection of interesting terms from the results with numbers and table of results.

# Static ggplot object
p <- gostplot(gostres, capped = FALSE, interactive = FALSE)

# Highlight terms
pp <- publish_gostplot(p, highlight_terms = c("GO:0048013", "REAC:R-HSA-3928663"), 
                       width = NA, height = NA, filename = NULL )

The gost results can also be visualized as a table.

# Table
publish_gosttable(gostres, highlight_terms = gostres$result[c(1:2,10,100:102,120,124,125),],
                        use_colors = TRUE, 
                        show_columns = c("source", "term_name", "term_size", "intersection_size"),
                        filename = NULL)

The same functions work also in case of multiquery results showing multiple Manhattan plots on top of each other and a multiple result table.

# Multiquery plot
gostplot(multi_gostres2, capped = TRUE, interactive = TRUE)
# Multiquery table
publish_gosttable(multi_gostres1, 
                         highlight_terms = multi_gostres1$result[c(1, 24, 82, 176, 204, 234),],
                        use_colors = TRUE, 
                        show_columns = c("source", "term_name", "term_size"),
                        filename = NULL)

In addition to the available GO, KEGG, etc data sources, users can upload their own custom data source using the Gene Matrix Transposed file format (GMT). The users can compose the files themselves or use pre-compiled gene sets from available dedicated websites like Molecular Signatures Database (MSigDB), etc.

Other tools

GProfiler2 also has a tool to convert gene IDs between databases, etc. It is called g:Convert.

# Gene identifier conversion with gconvert
gconvert(query = c("REAC:R-HSA-3928664", "rs17396340", "NLRP1"), organism = "hsapiens", 
         target="ENSG", mthreshold = Inf, filter_na = TRUE)
   input_number              input target_number          target    name                                                                       description                           namespace
1             1 REAC:R-HSA-3928664           1.1 ENSG00000010810     FYN FYN proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:4037]                                REAC
2             1 REAC:R-HSA-3928664          1.10 ENSG00000133216   EPHB2                                EPH receptor B2 [Source:HGNC Symbol;Acc:HGNC:3393]                                REAC
3             1 REAC:R-HSA-3928664          1.11 ENSG00000136238    RAC1                      Rac family small GTPase 1 [Source:HGNC Symbol;Acc:HGNC:9801]                                REAC
4             1 REAC:R-HSA-3928664          1.12 ENSG00000137575   SDCBP                      syndecan binding protein [Source:HGNC Symbol;Acc:HGNC:10662]                                REAC
5             1 REAC:R-HSA-3928664          1.13 ENSG00000149269    PAK1                  p21 (RAC1) activated kinase 1 [Source:HGNC Symbol;Acc:HGNC:8590]                                REAC
6             1 REAC:R-HSA-3928664          1.14 ENSG00000154928   EPHB1                                EPH receptor B1 [Source:HGNC Symbol;Acc:HGNC:3392]                                REAC
7             1 REAC:R-HSA-3928664          1.15 ENSG00000180370    PAK2                  p21 (RAC1) activated kinase 2 [Source:HGNC Symbol;Acc:HGNC:8591]                                REAC
8             1 REAC:R-HSA-3928664          1.16 ENSG00000182580   EPHB3                                EPH receptor B3 [Source:HGNC Symbol;Acc:HGNC:3394]                                REAC
9             1 REAC:R-HSA-3928664          1.17 ENSG00000196411   EPHB4                                EPH receptor B4 [Source:HGNC Symbol;Acc:HGNC:3395]                                REAC
10            1 REAC:R-HSA-3928664           1.2 ENSG00000071051    NCK2                          NCK adaptor protein 2 [Source:HGNC Symbol;Acc:HGNC:7665]                                REAC
11            1 REAC:R-HSA-3928664           1.3 ENSG00000077264    PAK3                  p21 (RAC1) activated kinase 3 [Source:HGNC Symbol;Acc:HGNC:8592]                                REAC
12            1 REAC:R-HSA-3928664           1.4 ENSG00000090776   EFNB1                                      ephrin B1 [Source:HGNC Symbol;Acc:HGNC:3226]                                REAC
13            1 REAC:R-HSA-3928664           1.5 ENSG00000101608  MYL12A                        myosin light chain 12A [Source:HGNC Symbol;Acc:HGNC:16701]                                REAC
14            1 REAC:R-HSA-3928664           1.6 ENSG00000102606 ARHGEF7      Rho guanine nucleotide exchange factor 7 [Source:HGNC Symbol;Acc:HGNC:15607]                                REAC
15            1 REAC:R-HSA-3928664           1.7 ENSG00000108262    GIT1                                   GIT ArfGAP 1 [Source:HGNC Symbol;Acc:HGNC:4272]                                REAC
16            1 REAC:R-HSA-3928664           1.8 ENSG00000108947   EFNB3                                      ephrin B3 [Source:HGNC Symbol;Acc:HGNC:3228]                                REAC
17            1 REAC:R-HSA-3928664           1.9 ENSG00000125266   EFNB2                                      ephrin B2 [Source:HGNC Symbol;Acc:HGNC:3227]                                REAC
18            2         rs17396340           2.1 ENSG00000054523   KIF1B                      kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]                                    
19            3              NLRP1           3.1 ENSG00000091592   NLRP1          NLR family pyrin domain containing 1 [Source:HGNC Symbol;Acc:HGNC:14374] ENTREZGENE,HGNC,UNIPROT_GN,WIKIGENE

GProfiler2 also has a tool to map gene IDs between species. It is called g:Orth.

# Mapping homologous genes across related organisms with gorth
gorth(query = c("Klf4", "Sox2", "71950"), source_organism = "mmusculus", 
      target_organism = "hsapiens", mthreshold = Inf, filter_na = TRUE,
      numeric_ns = "ENTREZGENE_ACC")
  input_number input         input_ensg ensg_number ortholog_name   ortholog_ensg                                                        description
1            1  Klf4 ENSMUSG00000003032       1.1.1          KLF4 ENSG00000136826           Kruppel like factor 4 [Source:HGNC Symbol;Acc:HGNC:6348]
2            2  Sox2 ENSMUSG00000074637       2.1.1          SOX2 ENSG00000181449 SRY-box transcription factor 2 [Source:HGNC Symbol;Acc:HGNC:11195]
3            3 71950 ENSMUSG00000012396       3.1.1         NANOG ENSG00000111704                 Nanog homeobox [Source:HGNC Symbol;Acc:HGNC:20857]
4            3 71950 ENSMUSG00000012396       3.1.2       NANOGP8 ENSG00000255192    Nanog homeobox retrogene P8 [Source:HGNC Symbol;Acc:HGNC:23106]

Other gProfiler versions

It is possible to use older (archived) versions or the beta version of gProfiler.

# Accessing archived version (gprofiler2 package is only compatible with versions e94_eg41_p11 and higher)
set_base_url("http://biit.cs.ut.ee/gprofiler_archive3/e95_eg42_p13")

# Accessing the beta release
set_base_url("http://biit.cs.ut.ee/gprofiler_beta")

# Check the current version
get_base_url()
[1] "http://biit.cs.ut.ee/gprofiler_beta"

Save your session info

For the sake of traceability, store the specifications of your R environment in the report, with the command sessionInfo(). This will indicate the version of R as well as of all the libraries used in this notebook.

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gprofiler2_0.2.0 knitr_1.30      

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.20         purrr_0.3.4       colorspace_2.0-0  vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 viridisLite_0.3.0 yaml_2.2.1        utf8_1.1.4        plotly_4.9.3      rlang_0.4.10      pillar_1.5.1      later_1.1.0.1     glue_1.4.2        DBI_1.1.1        
[17] lifecycle_1.0.0   stringr_1.4.0     munsell_0.5.0     gtable_0.3.0      htmlwidgets_1.5.3 evaluate_0.14     labeling_0.4.2    fastmap_1.1.0     httpuv_1.5.5      crosstalk_1.1.1   fansi_0.4.2       Rcpp_1.0.6        xtable_1.8-4      scales_1.1.1      promises_1.2.0.1  debugme_1.1.0    
[33] jsonlite_1.7.2    farver_2.1.0      mime_0.10         gridExtra_2.3     ggplot2_3.3.3     digest_0.6.27     stringi_1.5.3     dplyr_1.0.5       shiny_1.5.0       grid_4.0.2        tools_4.0.2       bitops_1.0-6      magrittr_2.0.1    lazyeval_0.2.2    RCurl_1.98-1.2    tibble_3.1.0     
[49] crayon_1.4.1      tidyr_1.1.3       pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.5     httr_1.4.2        R6_2.5.0          compiler_4.0.2